Hi! Today we will cover NSE and its veeeery weird nature. Are you ready? Me neither. SO LET’S DO IT

Our boys

Let’s meet our today’s villains. The three NSEketeers.

makeNseFunction1()
makeNseFunction2()
makeNseFunction3()

Scroll down to see el explanationi.

Each of the functions implements a different way to do a simple thing.

CREATING AN NSE VERSION OF A GIVEN FUNCTION WHICH RETRIVES A FIELD USING NON-STANDARD EVALUATION WITH BUILT-IN R MECHNISMS OR EXTERNAL METHODS

Sounds evil?

Well it is. But it is quite easy to understand.

NSE examples

The thing is that in R you can get an element from a, for example, list like so:

myList = list(a = 1, b = 2, c = 3)
myList$a
## [1] 1

BUT. Have you ever wondered what exactly is “a” here? The one in the second line.

Let’s check it:

a

… Actually I cannot check it, because my markdown wouldn’t compile.

I would get an error saying that “a” is not found.

…well, it was not defined, so it should NOT be found.

How does it work in the first example then? Let’s call it R magic for now.

Problem

Another question:

What if I wanted to pass “a” as a parameter to a function and use the $ operator inside my function like so:

myList = list(a = 1, b = 2, c = 3)
myFunction = function(a, myList){
  myList$a
}

myFunction(a, myList)
## [1] 1

WOW, it works with no problems.

well… it’s R - it only looks like it works.

Let’s see the example where I want to get “b”:

myList = list(a = 1, b = 2, c = 3)
myFunction = function(a, myList){
  myList$a
}

myFunction(b, myList)
## [1] 1

Oops.

I am not going to go into details, but the problem is related to passing the argument at it changing it’s… let’s call them “metaparameters”.

At least that’s how I understand it.

Fixing the problem

Long story short, to fix this error we got to do something like this:

myList = list(a = 1, b = 2, c = 3)
myFunction = function(a, myList){
  eval(substitute(a), myList)
}

myFunction(b, myList)
## [1] 2

What what what. What happend here?

In very very basic words we can say that:

“Function eval evaluates an expression using the given object (here myList)”

What about “substitute”?

It simple “retrives” the original variable name and uses it in the same way as in:

myList$a
## [1] 1

If you want a proper explanation I recommend checking those example:

Examples

Or these explanations:

Explanation Explanation

I also recommend listening to Hadley Wickham’s 5 (actually 6) minute talk about “Tidy evaluation”. It really helped me NOT to kill myself in the process of understanding. Hope it helps you too.

Tidy evaluation in 5 mins

Back to the the 3 functions

Getting back to our NSEketeers.

What you saw at the beginning is a few implementations which should create a NSE (Non-Standard Evaluation) function from a non-NSE (so just SE) function.

We want this call:

min(myList$a)

to be equal to this call

min_NSE(myList, a)

**BTW They also take into account formulas* but that would be too much to explain at once, so we’ll skip it.

We are here to check their time efficiency.

I also wanted to check their memory usage but… let’s see this Stack Overflow answer. Welp.

So onto the testing we go!

Efficiency testing

First we take a look at the results for all three methods and their results for a small dataset.

Rougly 100 random numbers from 1-100 - shouldn’t be difficult.

testedFunction = min
datasetSize = 100
dataset = list(a = sample(x = 1:100, size = datasetSize, replace = TRUE))

Well, I was expecting one of the methods to be the worse, but come on.

Function2 is awful.

But I think I know the reason.

The problem is that it uses a function enexpr from a library rlang and has to load this very library to work.

library(rlang)

Now, one may claim that it would be more accurate to measure the function WITHOUT loading the library, but it is clearly unfair.

A developer will HAVE to load the library if he wants to use it so the loading time has to be included in the total function time.

But this is not the interesting part. Let’s take a look at the next chart.

testedFunction = min
datasetSize = 10000
dataset = list(a = sample(x = 1:100, size = datasetSize, replace = TRUE))

A bit bigger dataset, nothing very spectacular happening. All bars went a bit up. Let’s go to the next one to check how it behaves with a different function - max().

testedFunction = max
datasetSize = 10000
dataset = list(a = sample(x = 1:100, size = datasetSize, replace = TRUE))

Again - nothing fancy. What about mean()?

testedFunction = mean
datasetSize = 10000
dataset = list(a = sample(x = 1:100, size = datasetSize, replace = TRUE))

Now, what about mean with a BIG dataset?

testedFunction = mean
datasetSize = 1000000
dataset = list(a = sample(x = 1:100, size = datasetSize, replace = TRUE))

Hmmm… that’s interesting. All bars are coming closer together.

What about a REALLY BIG INPUT.

testedFunction = mean
datasetSize = 100000000
dataset = list(a = sample(x = 1:100, size = datasetSize, replace = TRUE))

NOW THAT is interesting. One last chart before conclusion.

testedFunction = lm
datasetSize = 100
dataset = list(x = sample(x = 1:100, size = datasetSize, replace = TRUE), y = sample(x = 1:100, size = datasetSize, replace = TRUE))
formula = x ~ y

Now let me explain what happens, i.e. why the bars are getting closer and closer to each other.

It appears that the time of computations of the function grows faster than the time needed to apply our NSE magic.

In computer science this means that the *time complexity of the actual function is higher than the NSE transformations**.

Conclusion?

**It does not matter which (OF THESE) implementations you use, as long as you have a gigantic *dataset** - the time is going to be comparable.

For smaller datasets tough - use function3 - its the fastest and simplest too!

Here are the implementations btw

makeNseFunction1 <- function(fun) {
  function(data, elementOrFormula, ...) {
    functionEnvironment = environment()
    
    if (as.character(substitute(elementOrFormula)) %in% names(data)) {
      argument = data[[deparse(substitute(elementOrFormula))]]
    } else {
      allVariables = all.vars(elementOrFormula)
      for (variable in allVariables) {
        assign(variable, eval(as.name(variable), data), envir = functionEnvironment)
      }
      argument = elementOrFormula
      environment(argument) = functionEnvironment
    }
    fun(argument, ...)
  }
} 

makeNseFunction2 <- function(fun) {
  function(data, elementOrFormula, ...) {
    library(rlang)
    
    functionEnvironment = environment()

    if (as.character(substitute(elementOrFormula)) %in% names(data)) {
      elementOrFormula = substitute(elementOrFormula)
      argument = eval(enexpr(elementOrFormula), data)
    } else {
      allVariables = all.vars(elementOrFormula)
      for (variable in allVariables) {
        assign(variable, eval(as.name(variable), data))
      }
      argument = elementOrFormula
      environment(argument) = functionEnvironment
    }
    fun(argument, ...)
  }
}

makeNseFunction3 <- function(fun) {
  function(data, elementOrFormula, ...) {
    functionEnvironment = environment()
    
    if (as.character(substitute(elementOrFormula)) %in% names(data)) {
      argument = eval(substitute(elementOrFormula), data)
    } else {
      allVariables = all.vars(elementOrFormula)
      for (variable in allVariables) {
        assign(variable, eval(as.name(variable), data))
      }
      argument = elementOrFormula
      environment(argument) = functionEnvironment
    }
    fun(argument, ...)
  }
}